enables ARM Thumb support #1122

Phosphorus15 · 2020-06-12T15:32:42Z

This pr presents a draft of Core Theory/KB based ARM Thumb instructions' lifter, which is mostly a incomplete skeleton presenting how the final lifter will be.
There's still some key feature not presenting, including:

Heap & Stack memories representation
Control flow & PC register(treated as normal GPR for now) representation

Moreover, as issue #951 states, the ARM lifter and Thumb lifter should eventually share the same state (switch between them, precisely), the way how to integrate this lifter with the old ARM lifter still remains a problem. @ivg any idea how we can fix this?

ivg · 2020-06-12T16:16:52Z

Heap & Stack memories representation

Neither heap nor stack exist on the level of abstraction of instruction/lifter, so there is no need to model it.

Control flow & PC register(treated as normal GPR for now) representation

The lifter has to hide the PC register, so that ld PC shall be represented as jmp (PC+n) where n is the ARM PC offser (4 or 8 bytes IIRC).

... the way how to integrate this lifter with the old ARM lifter still remains a problem

In BAP 2.0 each program label (aka address) has its own architecture, therefore we need an analysis that will identify branches that switch the architecture. There are two caveats:

Right now we just assign all addresses the same architecture (the one that is in the binary header), so we need to update this code and let lifters to override the default arch
Architecture classification is undecidable at least in ARM/Thumb (a branch instruction that switches the mode could be unresolved/indirect) and the same two addresses can have both thumb and arm interpretation. In other words, it is an interesting task with many discoveries that are waiting for us.

XVilka · 2020-06-12T16:51:34Z

Architecture classification is undecidable at least in ARM/Thumb (a branch instruction that switches the mode could be unresolved/indirect) and the same two addresses can have both thumb and arm interpretation. In other words, it is an interesting task with many discoveries that are waiting for us.

This part can be solved with a superset assembler approach #944

ivg · 2020-06-12T17:07:42Z

Architecture classification is undecidable at least in ARM/Thumb (a branch instruction that switches the mode could be unresolved/indirect) and the same two addresses can have both thumb and arm interpretation. In other words, it is an interesting task with many discoveries that are waiting for us.

This part can be solved with a superset assembler approach #944

It is not needed as disassembler in BAP 2.x already speculative and superset. It is driven by the knowledge base, so it may at the same time disassemble all possible substrings in all supported architectures. The main question is the performance, we in general, don't want to have the full superset, even with invalid chains pruned (which is automatically done by our disassembler). That would be the question, how to find the right balance between precision and performance. We don't really want to double the CFG of each ARM binary.

Phosphorus15 · 2020-06-14T06:13:13Z

Heap & Stack memories representation

Neither heap nor stack exist on the level of abstraction of instruction/lifter, so there is no need to model it.

But we need a linear memory representation for instructions like LDR or PUSH whatsoever, simply model it using Theory.Mem maybe?

Control flow & PC register(treated as normal GPR for now) representation

The lifter has to hide the PC register, so that ld PC shall be represented as jmp (PC+n) where n is the ARM PC offser (4 or 8 bytes IIRC).

... the way how to integrate this lifter with the old ARM lifter still remains a problem

In BAP 2.0 each program label (aka address) has its own architecture, therefore we need an analysis that will identify branches that switch the architecture.

Is the lifter responsible of linking the program labels to instructions? Like in the bytoy lifter there's

let block seq data ctrl =
    Theory.Label.for_addr (Word.int seq) >>= fun label ->
    blk label data ctrl

which was called after each single instruction with current pc provided.

Phosphorus15 · 2020-06-14T06:33:41Z

We don't really want to double the CFG of each ARM binary.

Some of the info. are statically deterministic, though, in ARM ELF ABI docs we have

5.5.3 Symbol Values

In addition to the normal rules for symbol values the following rules shall also apply to symbols of type STT_FUNC:

If the symbol addresses an Arm instruction, its value is the address of the instruction (in a relocatable object, the offset of the instruction from the start of the section containing it).

If the symbol addresses a Thumb instruction, its value is the address of the instruction with bit zero set (in a relocatable object, the section offset with bit zero set).
For the purposes of relocation the value used shall be the address of the instruction (st_value & ~1).

Note: This allows a linker to distinguish Arm and Thumb code symbols without having to refer to the map. An Arm symbol will always have an even value, while a Thumb symbol will always have an odd value. However, a linker should strip the discriminating bit from the value before using it for relocation.

Which could be defined as an ARM-only knowledge provided by the binary file (ELF etc.) loader. Still, malicious program could switch the T flag arbitrarily, and it might happens that we don't have a well-defined binary at all, so this static info. is not totally enough.

ivg · 2020-06-15T15:11:13Z

But we need a linear memory representation for instructions like LDR or PUSH whatsoever, simply model it using Theory.Mem maybe?

Yes, machine instructions are fully self-contained (unlike bytecode instructions, which sometimes need extra modeling, because they are evaluated by a VM not a CPU). Whenever you will see a load or push instruction its operands will be fully defined.

Is the lifter responsible of linking the program labels to instructions?

No, it will be lined by the IR lifter.

ivg · 2020-06-15T22:12:36Z

Note: This allows a linker to distinguish Arm and Thumb code symbols without having to refer to the map. An Arm symbol will always have an even value, while a Thumb symbol will always have an odd value. However, a linker should strip the discriminating bit from the value before using it for relocation.

It is only relevant to the linker and the way how symbols are encoded in the symbol table (in this particular abi). The mode can be switched on any jump (that doesn't involve a symbol table) and both arm and thumb instructions can have even addresses (in fact they must have even addresses due to alignment requirements).

XVilka

It will require some tests as well, see e.g.:

XVilka · 2020-06-16T05:04:38Z

plugins/arm_thumb/thumb_flags.ml

+
+end
+
+(*


You can safely remove this.

XVilka · 2020-06-16T05:05:21Z

plugins/arm_thumb/thumb_mem.ml

+                        | _ -> raise (Lift_Error "`src` must be a register")
+                    )
+        | _ -> raise (Lift_Error "`dest` must be a register")
+    (* the `R` bit is automatically resolved *)


Please separate with a new line here an in the following code.

Phosphorus15 · 2020-06-16T06:33:15Z

Note: This allows a linker to distinguish Arm and Thumb code symbols without having to refer to the map. An Arm symbol will always have an even value, while a Thumb symbol will always have an odd value. However, a linker should strip the discriminating bit from the value before using it for relocation.

It is only relevant to the linker and the way how symbols are encoded in the symbol table (in this particular abi). The mode can be switched on any jump (that doesn't involve a symbol table) and both arm and thumb instructions can have even addresses (in fact they must have even addresses due to alignment requirements).

But it provides a way of initially determines the instruction set of a symbol (with certain ABI) at least.

Btw, which way would you suggest to represent PC? Obviously it should b a concrete value like Bitvec rather than an abstract Theory.Var for the reason of addressing & labeling, but it is not correct to make the lifter itself carry a concrete state value, I guess?

ivg · 2020-06-25T15:46:57Z

Btw, which way would you suggest to represent PC? Obviously it should b a concrete value like Bitvec rather than an abstract Theory.Var for the reason of addressing & labeling, but it is not correct to make the lifter itself carry a concrete state value, I guess?

Your guess is absolutely correct. Yes, the address of the lifted instruction is a static constant (for the target language) and is a parameter (of type Bitvec.t) for the meta language.

When you define the semantics for an instruction you build a value of type unit eff which is defined as

  type 'a eff = 'a effect knowledge

And the lifter itself is the function of type Theory.label -> unit effect knowledge, where unit effect is also known in Bap.Std as type insn and Theory.label is known as tid in Bap.Std, or term identifier, so in parlance of Bap.Std the lifter is a function of type tid -> insn knowledge and it returns a knowledge computation of the instruction semantics. That means that we can use the tid = Theory.label = program obj to obtain any information about the program that is identified by this tid (including the semantics itself, the function could be recursive). A program has a lot of properties, we can enumerate them with bapp list classes -f core-theory:program, which will output something like this,

    - bap.std:common-name        a unique name associated with the program
    - bap.std:insn               a decoded machine instruction
    - bap.std:mem                a memory region occupied by the program
    - bap.std:arch               an ISA of the program
    - core-theory:semantics      the program semantics
    - core-theory:label-aliases  the set of known program names
    - core-theory:label-ivec     the program interrupt vector
    - core-theory:label-name     the program linkage name
    - core-theory:label-addr     the program virtual address
    - core-theory:is-subroutine  is the program a subroutine entry point
    - core-theory:is-valid       is the program valid or not

Our task is to provide a value for the core-theory:semantics property (which in OCaml reflection has type unit effect and fulfill this task we can query from the knowledge base for any other property, e.g., we can get bap.std:insn which is the machine code representation (provided by the LLVM decoder) to get the decoding of the memory chunk, and the memory chunk itself is also accessible through bap.std:mem. We can get the address using core-theory:label-addr if the chunk of memory has an address. The provided label will serve us as the database key, e.g.,

let lifter label = 
   KB.collect Theory.Label.addr label >>= fun addr -> (* this is the address of the current instruction *)
   KB.collect Disasm_expert.Basic.Insn label >>= fun insn -> (* the LLVM provided decoding *)
   KB.collect Memory.slot label >>= fun mem -> (* the memory chunk, probably not needed *)
   build_the_semantics_object addr insn

Basically, you have the full access to the knowledge base in the lifter.

Besides, as a side note, the value of the PC register in some architectures is not equal to the address of the current instruction, sometimes it is shifted by some number of bytes (so it is pointing ahead of instructions), in arm it is 4 or 8 bytes, I don't remember. Also, llvm may mean by PC either the actual value of the PC register or the current instruction address. So keep this in mind.

You can also follow our discussion in the Aarch64 lifter PR (#1141), I think everything that we discuss there is applicable to this lifter as well. We may even end up with some code sharing.

And if you have any questions, please don't hesitate to ask.

Phosphorus15 · 2020-07-14T10:13:54Z

This brand new Thumb lifter has been updated to cope with the structure of #1174 , and is prepared to be individually fully functional after proper tests.

ivg · 2020-07-16T13:56:27Z

okay, let's close it, but keep in mind the discussions that have happened here.

ivg marked this pull request as draft June 12, 2020 20:21

ivg changed the title ~~[WIP] ARM Thumb support~~ enables ARM Thumb support Jun 12, 2020

ivg added the arm-lifter label Jun 12, 2020

XVilka suggested changes Jun 16, 2020

View reviewed changes

ivg mentioned this pull request Jun 16, 2020

failure on elf32-littlearm #1133

Closed

Phosphorus15 mentioned this pull request Jul 10, 2020

Core Theory based ARM lifter #1174

Closed

Phosphorus15 added 14 commits July 16, 2020 00:04

adds radare2 symbolizer

a115deb

Fixes errors within oasis file

fcdf8ee

fixes silly mistakes

0a3f07c

use 'isj' to dump symbols

b8e8ce1

thumb lifter draft

2c6dfd3

fix typo

13ae335

chop 'sym.imp' prefix

80f8dc8

extends register set

c03508a

complete & extract move-like instructions

3bb5fab

single memory l/s instructions

246dcae

multiple l/s instructions

d64507f

complete all bitwise move-like instructions

00b2fcd

finished instructions def

7aa7b65

code style clean-up

434c772

Phosphorus15 added 11 commits July 16, 2020 00:04

refactor for new radare2-ocaml interface

c5a6c1b

lift symbol extracting to allow partial failure

612f15a

clean-up refactor & indentation fix

111df62

polymorphic variant & pipe operator fix

1325ade

apply ocp-indent

f27131b

thumb plugin refinement

e99afb9

.merlin file update

640c80a

ocp-indent rerun

6f66029

address assigned to LR should be subtracted

8e019a0

register KB semantics promise

a8cd86b

now its up and working

b48931b

Phosphorus15 mentioned this pull request Jul 15, 2020

enables ARM Thumb/Thumb2 and interworking #1178

Merged

Phosphorus15 force-pushed the thumb branch from b5c4006 to b48931b Compare July 15, 2020 16:47

ivg closed this Jul 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enables ARM Thumb support #1122

enables ARM Thumb support #1122

Phosphorus15 commented Jun 12, 2020 •

edited

Loading

ivg commented Jun 12, 2020 •

edited

Loading

XVilka commented Jun 12, 2020

ivg commented Jun 12, 2020

Phosphorus15 commented Jun 14, 2020 •

edited

Loading

Phosphorus15 commented Jun 14, 2020 •

edited

Loading

ivg commented Jun 15, 2020

ivg commented Jun 15, 2020

XVilka left a comment

XVilka Jun 16, 2020

XVilka Jun 16, 2020

Phosphorus15 commented Jun 16, 2020 •

edited

Loading

ivg commented Jun 25, 2020

Phosphorus15 commented Jul 14, 2020

ivg commented Jul 16, 2020

enables ARM Thumb support #1122

enables ARM Thumb support #1122

Conversation

Phosphorus15 commented Jun 12, 2020 • edited Loading

ivg commented Jun 12, 2020 • edited Loading

XVilka commented Jun 12, 2020

ivg commented Jun 12, 2020

Phosphorus15 commented Jun 14, 2020 • edited Loading

Phosphorus15 commented Jun 14, 2020 • edited Loading

ivg commented Jun 15, 2020

ivg commented Jun 15, 2020

XVilka left a comment

Choose a reason for hiding this comment

XVilka Jun 16, 2020

Choose a reason for hiding this comment

XVilka Jun 16, 2020

Choose a reason for hiding this comment

Phosphorus15 commented Jun 16, 2020 • edited Loading

ivg commented Jun 25, 2020

Phosphorus15 commented Jul 14, 2020

ivg commented Jul 16, 2020

Phosphorus15 commented Jun 12, 2020 •

edited

Loading

ivg commented Jun 12, 2020 •

edited

Loading

Phosphorus15 commented Jun 14, 2020 •

edited

Loading

Phosphorus15 commented Jun 14, 2020 •

edited

Loading

Phosphorus15 commented Jun 16, 2020 •

edited

Loading